IGHV allele similarity clustering improves genotype inference from adaptive immune receptor repertoire sequencing data

Abstract In adaptive immune receptor repertoire analysis, determining the germline variable (V) allele associated with each T- and B-cell receptor sequence is a crucial step. This process is highly impacted by allele annotations. Aligning sequences, assigning them to specific germline alleles, and inferring individual genotypes are challenging when the repertoire is highly mutated, or sequence reads do not cover the whole V region. Here, we propose an alternative naming scheme for the V alleles, as well as a novel method to infer individual genotypes. We demonstrate the strengths of the two by comparing their outcomes to other genotype inference methods. We validate the genotype approach with independent genomic long-read data. The naming scheme is compatible with current annotation tools and pipelines. Analysis results can be converted from the proposed naming scheme to the nomenclature determined by the International Union of Immunological Societies (IUIS). Both the naming scheme and the genotype procedure are implemented in a freely available R package (PIgLET https://bitbucket.org/yaarilab/piglet). To allow researchers to further explore the approach on real data and to adapt it for their uses, we also created an interactive website (https://yaarilab.github.io/IGHV_reference_book).


INTRODUCTION
The adapti v e immune system is key in fighting the di v erse array of pathogens our bodies encounter.Adapti v e immune r eceptor r epertoir e sequencing (AIRR-seq) is a rising approach for studying the dynamics of immune responses ( 1 ).Two crucial steps in AIRR-seq analyses are the inference of germline variable (V), diversity (D) and joining (J) allele sequences and, in B cells, the inference of clonal lineages.The imm uno globulin (Ig)-encoding genomic loci are challenging to study because of their repetiti v e nature and e86 Nucleic Acids Research, 2023, Vol. 51, No. 16 PAGE 2 OF 14 structural variants (2)(3)(4).The complex structure of the human Ig heavy chain V (IGHV) locus on chromosome 14 is illustrated in Supplementary Figure S1A.
A widely used tax onom y for human IG genes, which provides a common language for V, D, and J germline subgroups , genes , and alleles ( 2 , 5 ), was codified by the ImMunoGeneTics Information System (IMGT) ( 6 ).This nomenclatur e is r eferr ed to her e as the International Union of Imm unolo gical Societies (IUIS) nomenclature, for the gene names are allocated according to a process governed by the IUIS.With recent technological and algorithmic advances in the field, many previously unknown alleles and structural variants have been reported ( 4 , 7-11 ).Naming these variants poses a challenge on the IUIS naming scheme, as it is not always clear how to link a germline allele sequence to a specific gene.
Germline annotation is typically performed using an aligner tool, which determines the germline allele by comparison to sequences listed in a 'germline r efer ence set'.For V genes, the accuracy of this assignment is strongly influenced by the sequencing coverage ( 9 , 12 ).Sequencing protocols that span all 290-320 base pairs of the V sequence permit the greatest accuracy, but partial coverage sequencing is often employed.Two such common protocols are BIOMED-2 ( 13 ), which utilizes primers in the frame wor k 1 (FW1) and frame wor k 2 (FW2) regions, and Imm unoSeq ( 14 ), w hich amplifies and resolves onl y the complementarity-determining region 3 (CDR3) and a small fragment of the V and J regions.Such partial coverage of the V region dramatically reduces the number of alleles that can be categorically resolved.This is particularly problematic as there is reduced di v ersity at the 3 end of the V gene germline sequences ( 9 , 15 , 16 ).Even when the sequences span the whole V region, sequence alignment against the germline r efer ence set often does not resolve a single categorical germline allele for e v ery sequence, because of duplicated sequences within the germline r efer ence set itself.These duplicated sequences within the germline r efer ence set reflect known gene duplications in the locus, and the fact that identical allele sequences can be localized to more than one gene in different individuals.In addition, B cells undergo an affinity maturation process that involves soma tic hypermuta tions, which result in mismatches from the germline.This can cause a V sequence to become equidistant from more than one sequence in the set , again leading to multiple allelic assignments.Inference of a personal genotype is an important step, in which the set of V(D)J alleles an individual carries is inferred from the set of sequence annotations in a r epertoir e (17)(18)(19).This step reduces the le v el of ambiguity for the annotations of individual sequences, by restricting the complete set of alleles to the personal genotype.
Ambiguity in allele assignments hinders clonal inference ( 12 , 20 ).Each clone stems from an ancestral nai v e B cell expressing an unmutated B cell receptor (BCR).In AIRRseq analysis, it is common to infer a BCR clone first as a group of sequences that share the same V and J germline assignments and CDR3 length ( 21 ), and then cluster the sequences in this group based on similarity in the CDR3 sequence.To achie v e correct clonal inference, accurately annotating the AIRR-seq data is ther efor e crucial, as mul-tiple or mis-assignments can result in biased clonal inference.Moreover, inferring clones based on gene annotations instead of allele annotations may obscure important information.
To address these challenges, we propose a two component approach, illustrated in Figure 1 .We first propose a new naming scheme for IGHV germline sequences.This is followed by a new genotype inference step that is based on the expression of each allele independently in the r epertoir e.
The current IUIS naming scheme classifies IGHV sequences by subgroup, gene, and allele.Subgroup is determined by sequence similarity and gene is determined by the ostensib le or der of appearance along the chromosomal locus ( 2 ).Since the order of genes along the locus is not well determined and likely varies between individuals, the IUIS names are sub-optimal f or man y analysis tasks.To address this, our proposed naming scheme is based on the clustering of alleles into Allele Sequence Clusters (ASCs) solely on the basis of similarity.These ASCs take the place of genes in the analysis.The intention here is not to replace the IUIS system in reporting analysis results, but rather to offer a method of data r epr esentation that is more tractable for analysis purposes.
The second part of our proposed approach is aimed at addressing the challenge of inferring personal genotypes.Current tools for genotype inference adopt a gene-based approach to determine the presence of an allele.This approach is based on calculating the relati v e frequency of an allele, which is then normalized by the total number of sequences assigned to the corresponding gene ( 17 , 22 , 23 ).This method is hindered by allele similarity, as many sequences are assigned to multiple alleles from different genes.Here, we propose a new genotyping approach, based on the consideration of the overall frequency of each allele normalized by the total number of sequences in the whole r epertoir e.Our method takes advantage of the new proposed naming scheme that collapses identical sequences into a single name.We show that our allele-based genotyping approach pr ovides impr oved r esults compar ed to existing tools and is in excellent correspondence with genotypes derived from genomic sequencing.

Data
We used heavy chain BCR r epertoir e data, sequenced from nai v e and non-nai v e B cells from individuals of three VD-Jbase ( 16 ) projects: P1 (nai v e), P11 (nai v e), and P4 (nonnai v e).Libr ary prepar ation and processing for projects P1 and P11 were performed as described in ( 24 ).The processing for project P4 r epertoir es is described in ( 25 ).For the current study, we downloaded the V, D and J allele reference set from IMGT (imgt.org) on July 2022, including the functionality annotation for each allele.Non-functional alleles from the V r efer ence set wer e discarded, r esulting in the exclusion of subgroup IGHV8, as none of the alleles in this subgroup are functional.Hence, the final V reference set used here included only alleles from subgroups IGHV1-7.and Genotypes (PIgLET) recei v es as input a reference set, and infers clusters using hierarchical clustering based on sequence similarity.Then an ASC annotated r efer ence set is obtained.The next step is re-annotating an AIRR-seq data with the r efer ence set.The output of this process is an ASC annotated AIRR-seq dataset (marked in orange).The right upper panel shows the ASC-based genotyping approach of PIgLET.PIgLET receives as input the ASC annotated AIRR-seq data.The genotype is inferred based on allele-specific threshold from the IGHV reference book.Finally, a genotype inference is generated (marked in green).The bottom panel outlines possible downstream analyses that can be conducted using PIgLET's output, with improved performance.

Allele similarity clusters
To create the allele similarity clusters (ASCs), we used the most recent available IGHV reference set from IMGT, with addition of undocumented allele sequences not found in the IMGT germline r efer ence set, inferr ed from both P1 and P11.The combined set was then filtered to include only functional alleles that start from the first position of the V sequence region as defined by the IMGT numbering scheme, and extending through at least position 318.This resulted in 280 unique allele sequences.We next trimmed the 3 ends of any germline sequences longer than 318 bp.Our objecti v e here was to maximize sequence lengths while facilita ting accura te comparisons of sequences with equivalent sequence content.Position 318 was seleted as the optimum based on the fact that, in AIRR-seq data, nucleotide variation beyond this position are impossible to reliably detect ( 8 ).Following trimming, the germline r efer ence set consisted of 278 unique alleles, which were then used to e86 Nucleic Acids Research, 2023, Vol. 51, No. 16 PAGE 4 OF 14 generate ASCs.For clustering, we calculated the Le v enshtein distance between all allele pairs up to position 318, followed by a hierarchical clustering step with complete linkage.The resulting tree was cut based on two similarity thresholds of 75% and 95% , to obtain the allele families and ASCs, respecti v ely.In setting the 95% similarity threshold, there was a trade off between the overall number of clusters and the number of clusters that contain alleles from distinct genes.To set the threshold, we plotted two metric evaluation as a function of the similarity threshold.First, the number of genes spanning more than one cluster, normalized by the total number of genes.Second, the number of clusters that contain alleles from distinct genes, normalized by the total number of clusters (Supplementary Figure S2).The plots show that the optimal range for the threshold is between 94% and 96%.For simplicity, we chose 95%, representing the middle of this range.Critically, the optimal threshold may vary depending on the germline r efer ence set used.Thus, when additional human alleles are discovered and incorporated, this threshold may need to be adjusted; additionally, implementation of the ASC approach for other species will r equir e optimization of germline reference set trimming and clustering thresholds depending on the input data.
As a result of the clustering, the alleles were renamed to r epr esent the new allele families and ASCs (Supplementary Table S1).For example, for the allele IGHVS1F2-G15*02, the family is r epr esented by F2, the ASC by G15, and the allele by 02.S1 is an indicator of the library amplicon length of a gi v en r efer ence set.A key table that links between the IUIS naming scheme and the ASC naming scheme can be found in Supplementary Table S1.The r efer ence set with the new naming scheme was then used for downstream processing.
It is worth mentioning a potential issue that concerns all AIRR-seq based genotype inference approaches: in some rare cases, two alleles differ only at the 3 end of the sequence.In human IGHV, this means beyond 318 in the IMGT numbering scheme.This imposes many instances of multiple assignments, as the aligners cannot differentiate between the two when the rearrangements had been trimmed upstream of the position 318, during the rearrangement process.In human IGHV, only two such cases exist, 3-66*01 and 3-66*04, 4-28*01 and 4-28*03.These cases should be treated separately, considering all particularities of the sequences, and should be reported with an adequate confidence le v el (Supplementary Table S1).Hence, for the downstream analysis in this paper, to avoid effects on the alignment process, V germline sequences that were trimmed for clustering were extended to their original length.As alleles 01 and 04 of IGHV3-66 and alleles 01 and 03 of IGHV4-28 only differ at position 319, they were collapsed in the cluster analysis.Hence, in the r efer ence set we have decided to include both alleles for both genes.

ASC-based genotype method
The ASC-based genotype utilizes an empirically deri v ed threshold to determine the presence of a gi v en allele within an individual's genotype.We first had to set a default thresh-old (10 −4 ) for the absolute allele usage fraction before tailoring it to each allele.The choice of 10 −4 was based on the current typical depth of AIRR-seq data, which in our datasets was between 10K and 20K sequences.With these depths, the threshold of 10 −4 includes sequences that appear once or twice in a gi v en sample.We then tuned the individual thresholds, from observations of the allele's usage across all availab le nai v e B cell samples present in VDJbase ( 16 ), to maximize the consistency between genotypes inferred using the ASC-based approach and information obtained by haplotype inference and genomic long-read data.Overall, 129 of 280 thr esholds wer e adjusted.A list of all thresholds applied can be found in Supplementary Table S1.
The threshold listed in Supplementary Table S1 were determined from a cohort of northern European individuals.Hence, for cohorts with different ethnic backgrounds or from other species, adjustments to the allele threshold might be r equir ed.Ther efor e, to determine a new allele-based threshold for a different population, users should follow the following steps.First, the new AIRR-seq data should be aligned with the corresponding ASC germline reference set.Second, for each allele, the fraction of the allele usage from the total number of sequences should be deri v ed.Thir d, if available, haplotype information should be extracted, and coupled with the allele usage from step two.Haplotype information can be obtained using an anchor gene such as IGHJ6, by separating the allele usage based on rearrangement with the anchor gene.Fourth, the default threshold should be either set to 10 −4 , or start from the determined allele-based threshold from Supplementary Table S1.Fifth, alleles should be grouped based on their ASCs, observing their usage against the starting allele-based threshold.The threshold should be adjusted in the following cases: • The allele is expressed at levels significantly higher than the default threshold in the study population.In this case, the threshold needs to be increased, but remain below the lowest expression level amongst the studied individuals.• Haplotype evidence supports an allele expressed below the default threshold.Here, we seek evidence that the allele is present on either one or both chromosomes, and if found, we adjust the threshold to include individuals expressing these alleles at low le v els.Howe v er, in cases where we have specific information on the allele's genomic location, such as with IUIS genes, it is necessary to ensure that there are no contradicting haplotype results.For instance, two alleles from the same IUIS gene should not appear on the same chromosome.If a contradiction is found, we maintain the current threshold until we gather more information on the possibility of gene duplication.• The study population expresses an undocumented allele .If an undocumented allele was inferred to be present in the stud y popula tion, first it should be a ttributed to the allele threshold of its closest allele, and then adjusted as in the two cases above.The cases we presented here are a guideline for determining the allele-based thresholds and adjusting from the default values.

IGHV r efer ence book
To further refine the threshold from specific populations of interest, we de v eloped an interacti v e w e b server ( https:// yaarilab .github.io/IGHVr efer ence book/ ) that pr esents the ASCs and the absolute frequencies of the alleles for datasets P1 and P11.Further, the server presents the chosen allelespecific thresholds shown in Supplementary Table S1.The server allows the end user to explore different choices for allele thresholds.This can be done using the interacti v e app on each ASC page, allowing the thresholds to be controlled by designated numeric input buttons.The app includes an interacti v e graph that displays the frequencies and thresholds.The interacti v e graph allows the user to e xplore haplotype information, if av ailable, b y 'clicking' the individual's point on the graph.Once the user adjusts any of the thr esholds, the app r eloads with the adjusted thr eshold and presents the resulting genotypes.Modification of thresholds can be helpful in maximizing discrimination, particularly in cases where some alleles are have lower expression levels than others ( 15 , 24 , 26 , 27 ).

AIRR-seq processing
To infer a personal genotype either by the ASC-based or the gene-based method ( 17 ), gene / allele assignments for each r epertoir e wer e determined using IgBLAST (V1.17) with the customized ASC germline r efer ence set (DOI:10.5281 / zenodo .7401239).Then, for each clone , a r epr esentati v e with the least number of mutations was chosen, undocumented alleles were inferred using TIg-GER ( 17 ), and in cases for which new alleles were found, reassignment was carried out.If the dataset came from nai v e B cells, the sequences wer e filter ed for no mutations within the V region up to position 316, accounting for possible sequencing errors at the end of the V region ( 24 ).For r epertoir es coming from full V region length amplicons, the r epertoir es wer e filter ed to omit 5 trimmed sequences.Sequences were also filtered based on whether there was sufficient 3 coverage of the V r egion, r equiring sequences to be at least 312 nucleotides long.For ASC-based inference, each allele's absolute usage was calculated, and in cases in which a gi v en sequence had more than a single assignment, the counts were divided among all clusters.For each allele within the r epertoir e, the absolute usage was compared to the specific threshold.Alleles that passed the thr eshold wer e then added to the final individual's genotype.For the genebased genotype inference, TIgGER's 'inferGenotype' ( 17 ) function was used with the chosen threshold, either 12 .5% or 5% .

Using genome long-read assemblies to validate alleles
Long-read assemblies from six samples generated using Sequel IIe HiFi reads and IGenotyper v1.0.0 ( 4 ) were used to validate the ASC-based genotype approach.The samples are those described previously ( 28 , 29 ).The assemblies were aligned to a custom imm uno globulin heavy chain (IGH) genome r efer ence containing pr e viously discov ered IGHV genes using BLASR ( 4 , 30 ).From the alignments, fully phased gene sequences were extracted, including 5 UTR, leader -1, leader -2 and exons.In 2 cases w here a full yphased gene sequence could not be extracted, the assemb ly was re vie wed manuall y and a ha plotype was determined on the basis of the coding region only.The matched repertoir e sequences wer e processed as described in the 'AIRRseq processing' subsection of the Methods.Because the sequences wer e pr e-sorted to IGM, we inferr ed the genotype only for unmutated sequences (after inferring novel alleles).
The comparison between the genomic annotation and the ASC-based genotype was made based on germline sequence similarity.

Allele naming system based on germline hier ar chical structure
Using hierarchical clustering with complete linkage, we defined a two-le v el naming scheme for the set of functional germline alleles (downloaded from IMGT July 2022): allele families and ASCs.For the family le v el, w e follow ed the logic and threshold of 75% nucleotide similarity from IMGT ( 31 ).Since we applied this methodology to the contemporary set of functional alleles, the resulting families mildly deviate from the IUIS family definitions.In particular, the IGHV3 subgroup is split into two families with our approach (Figure 2 A, the orange dashed circle defines the 75% threshold).Using the same hierarchical tree, we clustered the sequences based on 95% nucleotide similarity (Figure 2 A, blue dashed line).This resulted in 46 clusters, which we defined as ASCs, some of which consisted of several genes (Figure 2 B).In addition, the alleles of some genes w ere split betw een different clusters.Adapting the two-le v el naming scheme resulted in an annotated germline reference set that reduced ambiguities in se v eral analysis steps, as sho wn belo w.Moreover, as mentioned in the introduction, partial sequencing protocols exacerbate the computational challenges associated with the assignment of highly similar alleles originating from distinct genes.Our proposed naming scheme can be generalized in a straightforward way to these situations.To adapt the above naming scheme to partial V sequences, we computationally trimmed the 5 region of V sequences in the germline r efer ence set according to sequence lengths obtained using the BIOMED-2 ( 13 ) and ImmunoSeq ( 14) protocols.For simplicity, we defined the sequencing protocols by the library amplicon length, and named the full-length amplicons 'S1', the partial V sequences corresponding to the BIOMED-2 style 'S2', and the minimal V coverage of ImmunoSeq 'S3' (Figure 2 A).
Depending on the amplicon length used, we obtained a different number of ASCs.As expected, the 5 V trimming resulted in higher similarity between the alleles.Compared to the 54 genes in the IUIS database, after clustering we observed 46 ASCs in S1, 43 in S2 and 11 in S3 (Figure 2 C).

ASC-based thresholds enhance genotyping accuracy
Many computational genotyping tools consider the relati v e frequency of a candidate allele during inference and filtering steps ( 17 , 22 , 32 , 33 ).An inference is made or accepted if the number of assignments to an allele exceeds a thr eshold per centage of the total assignments to all alleles of the corresponding gene.A need for allele-specific filtering processes has been raised previously ( 27 ).Here, we implemented a method based on the explicit comparison of each allele's frequency in the r epertoir e under study with an allele-specific threshold.In the presented method, an allele enters the genotype if its absolute frequency exceeds its allele-specific threshold.The genotype inference process was done using the ASC naming scheme, such that identical nucleotide sequences from different chromosomal positions were collapsed into a single allele annotation.This addressed issues that can confound frequency observations in current methods: variable expression levels between alleles of the same gene, multiple assignments, duplicated genes, and short reads.
We started with a default value for the allele-specific threshold (10 −4 ), and assessed it allele by allele by comparing the outcome of the genotype for this value with the picture that emerges from a complementary haplotype inference (see methods).After re vie wing and, where necessary, adjusting the allele-specific thresholds for all the IGHV alleles observed in two nai v e cell datasets, VDJbase projects P1 ( 24 ) and P11 (PRJEB58016), 142 r epertoir es in total, we compared the resulting inferred genotypes with the ones inferred by TIgGER, which is a gene-based inference tool (Figure 3 A for P1, and Supplementary Figure S3 for P11).In the ASC-based genotype approach, alleles enter the genotype if their usage is higher than the allele-specific thresholds.In TIgGER's genotype inference, on the other hand, the alleles enter the genotype based on the relati v e usage normalized by all sequences mapped to this gene.A common step in this kind of analysis includes an undocumented allele inference.In the ASC-based genotype inference, each inferred undocumented allele is gi v en the allelespecific threshold of its most similar allele.Overall, there were 5767 allele calls that were included in either one or both genotypes.Results were concordant for inference of highly used alleles, with 5548 allele calls that fully matched between the methods ( T ilde; 96% , green squares).Howe v er, ther e wer e 3 allele calls that only enter ed the genotype with TIgGER (pink squares), and 216 alleles that were called only by the ASC-method (black squares).
The potential false positi v es in genotypes inferred by TIg-GER ( 17 ) were seen in cases where all observed alleles of a particular gene wer e expr essed at low levels, according to population data.An example is IGHV7-4-1 (part of ASC IGHVS1F4-G21).In all individual genotype inferences, there was not a single situation of heterozygosity for this cluster, as in most individuals there was one dominant allele.In the single occasion where heterozygosity was declared, both alleles *01 and *02 entered the genotype.Howe v er, the inference of allele *02 is likely to be incorrect.In this particular sample (VDJbase: P1 I44 S1), the poorly expressed allele was *01, with 4 sequences, while the highly expressed allele *02 only had a single sequence.This devia tes from wha t is seen in the population.The three alleles attributed to this cluster vary in usage, with allele *02 being the most expressed allele ( 27 ) with a median usage of 1.19 × 10 −2 .This is 33 times more than the second expressed allele (*01).Hence, the situation where allele *01 dominates over allele *02 is unlikely ( P value of 4 × 10 −6 according to a bi-nomial test), and the identification of a read associated to allele *02 might be the result of a mutation or a PCR or sequencing error.This indicates clear deviations between the approaches that may lead to different specificities in clusters with low expression.
Potential false negati v es in the genotypes inferred by TIgGER ( 17 ) are seen in cases where one allele is expressed at a lower rate than other alleles of the gene, according to population data.An example is IGHV3-64*02, corresponding to ASC IGHVS1F2-G15*01.This allele entered the genotype using the ASC-based method, but not using the conventional TIgGER methodology ( 17 ).The IGHVS1F2-G15 cluster combines alleles from two IUIS genes , IGHV3-64 and IGHV3-64D , which merge under the 95% threshold.The alleles of this cluster vary in usage, i.e., alleles *05, *06, and *07, typically assigned to IGHV3-64D, are more frequently used than *02 and *01 (Figure 3 B).Allele IGHVS1F2-G15*01 (IGHV3-64*02) is expressed at a considerably lower le v el than the other alleles, with a median of 2.1 × 10 −4 absolute frequency: roughly 12 times lower than the second most lowly expressed allele, IGHVS1F2-G15*02 (aka IGHV3-64*01).Despite the fact that allele IGHVS1F2-G15*01 was above the ASC-based default threshold (10 −4 ) and as such entered the genotype, it is far below TIgGER's ( 17 ) relati v e fraction threshold of 12 .5% or 5% , as it has a median relati v e frequency of 1 .86% (Figure 3 B).This raised the question of whether the alleles with low expression truly exist.To validate the inference of allele IGHVS1F2-G15*01, we looked at the haplotype of alleles IGHVS1F2-G15*01 and IGHVS1F2-G15*02, since they come from the same chromosomal location (IGHV3-64).We haplotyped se v en indi viduals who ostensibly included allele *01 with the ASC-method but not in TIgGER ( 17 ), using heterozygosity at IGHJ6 ( 15 ) as the anchor (Figure 3 C).In all se v en indi viduals, alleles *01 and *02 were found on opposite chromosomes, strongly supporting the presence of allele IGHVS1F2-G15*01.This example demonstrated the sensitivity of the ASC-based approach to inferences of alleles with low expression, which may provide important insights for future studies.
Figure 3 D summarizes the distribution of allele prevalence in cohorts P1 and P11.Se v en out of the 280 alleles present in the ASC germline reference set appeared in all 142 individuals, while 41% of the alleles, 116 out of 280, did not enter any of the genotypes.This implies that reference sets should potentially be population-specific ( 34 , 35 ), or that the current reference set includes a large fraction of unexpressed or non-existent alleles ( 36 ).

Allele usage reporting
Subgroups , genes , and sometimes alleles are commonly used as AIRR-seq features, for example in reporting overexpression of specific genes or families in the context of specific diseases.These features are highly sensiti v e to the nomenclature and the genotypes of the individuals in the cohort.Her e, we compar ed the r eporting of allele-le v el usage versus gene or cluster level.Supplementary Figure S4 shows that reporting of usage was highly influenced by the genotype of the individuals.Figure 4 17) and from the ASC-based method.The bottom panel is the heatmap comparison, where each row is a genotype inference of an individual from the P1 dataset and each column is a different allele.The alleles shown in the heatmap are those for whom at least one of the methods inferred a genotype.Black and Pink colors r epr esent alleles that only entered the genotype either in the ASC-based method or in TIgGER ( 17) with a 12 .5% thr eshold, r especti v ely.Gr een r epr esents alleles that enter ed the genotype in both methods, and white r epr esents alleles that did not pass in both methods.The top panel is the summation of the heatmap e v ents.The y-axis is the count of the individuals for whom a gi v en allele entered the genotype.The x-axis is the different alleles.The colored strip bar shows the alleles ASC group and matches the colors of those shown in Figure 2 (B) The relati v e and absolute frequencies of the ASC IGHVF2-G15.Each dot is an individual for whom allele 01 entered the genotype with the ASC-based method, but did not in TIgGER ( 17 ).The colors r epr esent the different individuals.Each column is a different allele from the cluster.The top row is the absolute frequency and the bottom is for the relati v e frequency.( C ) Haplotype inference based on IGHJ6 for the individuals from ( B ).Each facet is a different individual, and the facet color matches the dots from (B).In each facet, the top row and orange color is the frequency for the IGHJ6*02 chromosome and the bottom and blue color for the IGHJ6*03 chromosome.The x-axis is the different alleles for the cluster, and the y-axis is the sequence count.( D ) A histogram of the allele ab undance distrib ution in the studied population.The x-axis is the number of individuals attributed to each allele, and the y-axis is the number of alleles.
depth comparison for cluster IGHVS1F2-G5, for which the observed mean absolute usage in individuals who carry alleles *04 and *05 was significantly higher than in those who carry *03 and *04 (Figure 4 A).If we had presented the usage results in an aggregated manner, howe v er, disregar ding the genotype allele combination attribute, this variation would have been masked (Supplementary Figure S4, left upper panel).Moreover, if IUIS gene names were used to report the usage of these alleles, it would have been split between the V3-43 and V3-43D columns.Consequentl y, w hen studying allele usage in human cohorts, we recommend that usage is reported at the allele or ASC le v els, to avoid unnecessary ambiguities.

Genomic validation of the ASC-based genotype
We validated our ASC-based genotyping method using a paired dataset drawn from six subjects, comprising full length AIRR-seq r epertoir e sequencing of IGM from RNA isolated from PBMCs, and haplotype-partitioned assemblies of the genomic IGHV locus derived from long-read sequencing ( 29 ).Across the six subjects (Figure 5 ), a total of 304 ASC allele calls were made from the AIRR-seq r epertoir es, counting the infer ence of a single allele in a single individual as an allele call.The comparison between the ASC allele calls with the genomic annotation was made based on sequence similarity.In three subjects, genomic assemblies were not fully resolved using IGenotyper, either  A heatmap of a comparison between the ASC-based genotype method and genomic validation.Each row is an individual, and each column is a different allele.The alleles shown in the heatmap are those for which at least one of the methods inferred a genotype.Dark green and dark purple r epr esent alleles that are only present in the ASC-based method or genomic validation, respecti v ely.Light gr een r epr esents alleles that ar e pr esent in both methods, and white r epr esents alleles that ar e not seen in either.Light purple r epr esents alleles of genes that were resolved with manual inspection.not spanning certain genes, or diploid assemblies r epr esenting both haplotypes were not resolved for all genes.This resulted in an inability to validate 4 of the 304 inferences made from the AIRR-seq analysis.Howe v er, with manual examination of the assemblies we were able to validate them (Figure 5 , light purple).For the remaining 300 inferred allele calls, 299 were concordant between the ASC and genomic results (light green squares).Meaning that overall w e w ere able to match 303 of the calls with the genomic assemblies ( > 99.6 percent).The single discordant allele was IGHVS1F8-G39*03 (IGHV4-34*01) in subject SC-19 (dark green square); in this case, an undocumented IGHV4-34 allele was annotated from the genomic data, characterized by novel SNPs located at the very 3 end of the V region (position 319, and 320).The 3 end of the V region is often trimmed upon recombination, hence, as mentioned above, inferring an undocumented allele from AIRR-seq repertoire sequences based on these positions is unreliable ( 7 ).
A peculiar e xample involv ed the inferred allele IGHVS1F8-G36*03 (IGHV4-59*08).This allele has been speculated to e86 Nucleic Acids Research, 2023, Vol. 51, No. 16 PAGE 10 OF 14 reside at gene IGHV4-61 ( 27 , 37 , 38 ).Directly supporting this, our comparison showed that the germline sequence of the allele perfectly matched the long-read assemblies spanning IGHV4-61.Thirty-one allele calls were found only in the genomic samples (dark purple squares) and not in the ASC genotypes, implying that these alleles are poorly or not at all expressed.Such examples have been described in the literature ( 24 , 27 , 39 ).
In summary, out of the 304 allele calls that were made across six individual genotypes using the ASC-based method, we found potential contradictions from the genomic data only in two cases.These cases most likely indicate technical issues with the genomic assembly due to reduced cover age, r ather than in the ASC-based genotype inference method.

Generalizability to other germline reference sets
One potential limitation of the proposed naming scheme is that the specific alleles in the germline r efer ence set determine the allele families and ASCs, hence the clustering may change when alleles are added or removed from the set.To quantify the impact of an altered germline r efer ence set, we cr eated a r educed germline r efer ence set consisting only of the alleles that entered the genotype of P1 individuals, as determined by our ASC-based method.This is an example of transferring one germline r efer ence set from one dataset to another without adjusting it.We then applied the clustering algorithm and obtained the new families and clusters (Figure 6 A).Compared to the original set, two cluster pairs were merged, G36 / G37 and G43 / G44, and G13 and G38 were dropped, as none of their alleles entered any of the genotypes.As shown in (Figure 6 A), the overall structure of the clusters was maintained except the relati v ely minor changes detailed above, despite the drastic reduction from 280 to 164 alleles (Figure 6 B).From this, we conclude that the clustering method is relati v ely robust to changes in the r efer ence set composition.
To further assess the flexibility and effect of the reference set, we tested the multiple assignments in a nonnai v e r epertoir e.Multiple assignments ar e cases in which the aligner, IgBLAST in our evaluations, cannot determine a single matched allele, and outputs multiple options for the most likely germline allele.This can be caused by sequencing errors, soma tic hypermuta tion, identical germline alleles shared by multiple genes, or a combination.We explored this effect using the P4 dataset from VDJbase, which included non-nai v e r epertoir es from 28 individuals.We aligned the r epertoir es thr ee times, once with the IUIS gene definitions downloaded from IMGT (the IMGT set), once with an identical set of sequences but using the proposed assignment nomenclature (the S1 set), and once with the reduced germline reference set described above (the reduced S1 set).We calculated the fraction of sequences tha t were a ttributed by the aligner to more than a single gene / ASC. Figure 6 C shows an expected, significant 3-fold reduction in multiple assignments between the IMGT set and the S1 set.The reduced S1 set showed a further reduction in multiple assignments.
We then applied the ASC approach to other AIRencoding genomic loci.We clustered the sets of functional alleles downloaded from IMGT (July 2022) for human IGKV , IGLV , TRBV , and TRAV.We applied the same thresholds of 75% and 95% for determining the allele families and ASCs (Figure 7 ).The IGK locus is unique because it harbors two large duplicated V gene blocks that are inversely oriented, separated by a large region enriched with complex r epeats.Her e, as in IGHV, some genes share alleles with identical sequences.As expected, these duplicated alleles ar e cluster ed together under the 95% threshold.A split is observed in IGKV1-17, whose alleles are assigned to two ASCs.In the IGL locus, where the IUIS nomenclature defines 10 subgroups, we found 12 families using our approach and thresholds.Four genes were combined into ASCs, and a single gene was split into two ASCs.The loci of TRB and TRA r emained r elati v ely constant, except for four TRBV genes, which were merged into two ASCs.We de v eloped an interacti v e applica tion tha t applies the ASC naming scheme to V allele r efer ence sets from different loci and organisms, https://yaarilab .github .io/IGHV r efer ence book/alleles groups.html .As the r eference set can change over time, we recommend not using the nomenclatur e in r eporting but only in the downstream analyses.Ne v ertheless, for backtracking, reproducibility, and interoperability, we maintain a Zenodo archi v e ( https://doi.org/10.5281/zenodo.7401189) of all ASC runs conducted by our w e b server.It allows translation of the allele cluster names into IUIS names and also into the unique names suggested in the supplementary materials (Supplementary Table S1).

DISCUSSION
Many studies have used r epertoir e sequencing data to explore the IG and TR loci using inference tools ( 17 , 32 , 33 , 40-43 ).As a consequence, a plethora of allele sequences have been discovered ( 7-10 , 24 , 44-47 ).In many cases, alleles can be assigned to specific genes.Howe v er, in some cases, gene duplications and e xtensi v e inter-indi vidual haplotype variations (including structural variants yet to be identified) present challenges for the the accurate assignment of alleles to genes ( 27 ).In less documented species, genomic loci are often not e xtensi v ely characterized ( 48 ).For e xample, rhesus macaque and mouse germline r efer ence sets have been published as discrete sets of allele sequences without gene attributions ( 45 , 49 ).
In this study, we report on two innova tions tha t can be highly beneficial in such situations.The first is our proposed naming scheme that organizes alleles into clusters based on sequence similarity, which aids downstream analyses.The ASCs can be used for clonal inference, usage reporting, and genotype and haplotype inferences.We propose that the ASC naming scheme can be utilized for AIRRseq analysis in understudied species until comprehensi v e characterizations of genomic loci are conducted.That being said, the proposed scheme is not meant to replace the existing IUIS naming, but rather to accompany it and allow for a more inclusi v e analysis.We created an R packa ge (PIgLET, https://bitb ucket.org/yaarilab/piglet), which includes an example dataset for testing purposes, and an online application within the ASC w e bsite ( https://yaarilab.github.io/IGHVr efer ence book/alleles groups.html).This tool allows users to infer ASCs based on their own IGHV allele r efer ence set and plot the ASC results.For better results, we recommend using the ASC germline reference set as a personalized database for aligner softwares such as Ig-BLAST.Further, we suggest the user to archi v e its ASC germline r efer ence set using a Zenodo doi.We created an archi v e in Zenodo (10.5281 / zenodo.7401239)for the ASC germline r efer ence set pr esented her e.The ar chi v e will be updated as we release new r efer ence sets.As mentioned, a potential use case for our proposed naming scheme is for clonal inference methods.Many of the clonal inference methods rely on V gene assignments, which are impacted by similar genes and alleles.Ther efor e, utilizing the ASC approach may lead to better clonal inference performance.
The second innovation we report is a new and improved approach for inference of a personal genotype and for determination of VDJ allele usage from AIRR-seq data.The approach is based on the absolute frequency of allele usage within a specific popula tion, ra ther than on relative usage , which is normalized at the gene le v el.We created an interacti v e w e bsite where each pa ge shows the allele usa ge across the nai v e IGHV r epertoir es from P1 and P11 studies of VDJbase.The site allows users to view the default ASCbased allele-specific thresholds, and to explore the implications of changing these values.Altering the default values is primaril y ad vised for experienced users w ho wish to ada pt these values for different populations or species.Our site will be continuously updated as more nai v e AIRR-seq and direct genomic sequencing da tasets accumula te.Along with the site, the thresholds for allele detection in VDJbase will also be updated.Moreover, as new species are sequenced and published, we will include them in the site and update VDJbase accordingly.
In summary, the ASC approach creates a clean dataset for analytics.It transforms an existing germline r efer ence set into a set of sequences that correctly represent the assignments that can be made in a r epertoir e.This is analogous to the da ta prepara tion step in machine learning.When analysis is complete, results can be transformed back into the language of the existing data set, at which point the ambiguities imposed by the limitations of the experiment are made clear.As an example, consider an IGH r epertoir e deri v ed from reads amplified with BIOMED-2 primers.Using the S2 germline r efer ence set provided with the ReferCreated with BioRender.comenceBook, analysis might establish that allele IGHVF2-G10*09 is expressed at a high le v el compared to baseline.Translating back to IUIS nomenclature, the equivalent result is that 'a combination of alleles IGHV3-53*03 and IGHV3-53*02 ar e expr essed at a high le v el relati v e to baseline'.It is not possible to tell from the r epertoir e whether the ov er-e xpr ession is r elated to allele *03 or *02, because they are indistinguishable in the reads, as the primers used mask differences between them.The ASC terminolo gy correctl y addresses this, without the need e86 Nucleic Acids Research, 2023, Vol. 51, No. 16 PAGE 12 OF 14 Figure 7. Allele clusters for V genes from IGL / K and TRB / A loci.Each alluvial plot r epr esents the cluster division for a gi v en locus.The first row of each plot shows the division of the families, the second row the ASCs, and the third IUIS gene clustering.The colors r epr esent the allele clusters.White r epr esents IUIS genes that have been r e-cluster ed into more than a single allele cluster.
to manage the ambiguity explicitly in the analytics pipeline.Users can also a ppl y the a pproach in circumstances w here there is no existing gene-based germline r efer ence set, by using the capabilities provided in PIgLET.This will create a gene-like r efer ence set, and a baseline expression level for each allele.Because PIgLET's personalized genotyping is based on the expression levels of individual alleles, it can be used on r epertoir es pr epar ed in this way, and hence improve the quality of assignments.
We have demonstrated the application of ASC-based allele usage information to an allele expression analysis in AIRR-seq studies (Figure 4 ), and show how it reduces the number of multiple gene assignments (Figure 6 ).The resulting allele usage vector provides a clear signal, tailored to the details of the underlying data set, which can be used in graphical reports or machine learning applications.The conclusions can be translated back to IUIS nomenclature.
It is known that some alleles of a gene may be expressed at higher le v els than other alleles.An important feature of the ASC-based genotype method, is its ability to accurately assess the validity of alleles with low expression.Identifying such alleles with low expression and including them in the genotype can be critical for investigating disease susceptibility ( 39 , 43 , 50-54 ).
We validated the ASC-based approach by comparing AIRR-seq inferred genotypes with a genotype based on dir ect long r ead genomic sequencing ( 4 ).Even though some r epertoir es in these genomically sequenced cohorts had relati v ely low AIRR-seq depths, the comparison showed a strong concordance between the direct sequencing and the proposed inference method.
The set of unique alleles inferred from two large studies in VDJbase by our method constitutes less than 60% of the curr ent IGHV r efer ence available in IMGT.This brings up again the interesting debate of whether all alleles in the existing r efer ence set truly exist.This point was pr eviously r eviewed in ( 36 ), in which the authors discovered that se v eral alleles were erroneous.On the other hand, many ethnic populations are understudied ( 35 ), and most likely store many more alleles to be discovered.We belie v e that these matters should be further discussed and re vie wed to curate an optimal r efer ence set for AIRR-seq analyses.As demonstrated by Rodriguez et al. ( 28 ) different ethnic backgrounds influence the IGH composition (i.e., genes , deletions , etc.).Our PAGE 13 OF 14 Nucleic Acids Research, 2023, Vol. 51, No. 16 e86 study is based on samples solely deri v ed from a northern European setting.With more repertoire data curated from individuals with different ethnic backgrounds, the allelespecific threshold might have to be tailored to the different populations.We envisage that with the rising interest in AIRR-seq, future studies will re v eal more di v ersity, which will contribute to the efforts to enhance both the ASC w e bsite and VDJbase, and to the optimization of inferences and tools.

Figure 1 .
Figure1.PIgLET workflow The left upper panel illustrates the re-annotation and clustering inference steps.Our Program for Ig Allele Similarity Clusters and Genotypes (PIgLET) recei v es as input a reference set, and infers clusters using hierarchical clustering based on sequence similarity.Then an ASC annotated r efer ence set is obtained.The next step is re-annotating an AIRR-seq data with the r efer ence set.The output of this process is an ASC annotated AIRR-seq dataset (marked in orange).The right upper panel shows the ASC-based genotyping approach of PIgLET.PIgLET receives as input the ASC annotated AIRR-seq data.The genotype is inferred based on allele-specific threshold from the IGHV reference book.Finally, a genotype inference is generated (marked in green).The bottom panel outlines possible downstream analyses that can be conducted using PIgLET's output, with improved performance.

e86 16 PAGEFigure 2 .
Figure 2. Allele Similarity Clusters.( A ) Hierarchical clustering of the functional IGH germline r efer ence set.The inner layer shows a dendrogram of the clustering.The dashed lines indicate the sequence similarity of 75% (light blue) and 95% (red).The dendrogr am br anches are colored by the 75% sequence similarity.The inner colored circle shows the clusters and alleles for the library amplicon length of S1, the middle circle for the length of S2, and the outer for S3.The white spaces in each circle indicate alleles that cannot be distinguished within each respecti v e germline reference set.The arrows indicate the direction and the extent of allele collapsing, with respect to the adjacent inner circle.( B ) An alluvial plot showing the connection between the S1 ASC and the IUIS genes.The colors r epr esent the S1 ASC .W hite r epr esents IUIS genes whose alleles ar e cluster ed into mor e than a single allele similarity cluster.(C ) The frequency of the subgroups / families , genes / clusters , and alleles for each amplicon length.The categories along the x-axis reflect the amplicon lengths and the y-axis is the count of the unique subgroups / families, genes / clusters, or alleles.

16 PAGEFigure 3 .
Figure3.Genotype inference comparison between the ASC-based and TIgGER fraction method.( A ) A heatmap comparing the genotypes inferred from TIgGER( 17 ) and from the ASC-based method.The bottom panel is the heatmap comparison, where each row is a genotype inference of an individual from the P1 dataset and each column is a different allele.The alleles shown in the heatmap are those for whom at least one of the methods inferred a genotype.Black and Pink colors r epr esent alleles that only entered the genotype either in the ASC-based method or in TIgGER (17) with a 12 .5% thr eshold, r especti v ely.Gr een r epr esents alleles that enter ed the genotype in both methods, and white r epr esents alleles that did not pass in both methods.The top panel is the summation of the heatmap e v ents.The y-axis is the count of the individuals for whom a gi v en allele entered the genotype.The x-axis is the different alleles.The colored strip bar shows the alleles ASC group and matches the colors of those shown in Figure2(B) The relati v e and absolute frequencies of the ASC IGHVF2-G15.Each dot is an individual for whom allele 01 entered the genotype with the ASC-based method, but did not in TIgGER( 17 ).The colors r epr esent the different individuals.Each column is a different allele from the cluster.The top row is the absolute frequency and the bottom is for the relati v e frequency.( C ) Haplotype inference based on IGHJ6 for the individuals from ( B ).Each facet is a different individual, and the facet color matches the dots from (B).In each facet, the top row and orange color is the frequency for the IGHJ6*02 chromosome and the bottom and blue color for the IGHJ6*03 chromosome.The x-axis is the different alleles for the cluster, and the y-axis is the sequence count.( D ) A histogram of the allele ab undance distrib ution in the studied population.The x-axis is the number of individuals attributed to each allele, and the y-axis is the number of alleles.

Figure 4 .
Figure 4. Gene usage is associated with genotype.( A ) The y-axis is the genotype allele combination frequencies of IGHVS1F2-G5, normalized by the number of sequences in the whole r epertoir e, for individuals in P1.The x-axis is the genotype allele combination, ordered by the number of individuals carrying the combinations ( B ).Each point is an individual, and colors r epr esent the order of the genotype allele combination.A Tukey's HSD multiple comparison test was performed for groups containing four individuals or more.The adjusted p values are indicated on the connecting line; only the sta tistically significant combina tions were drawn.ns: P > 0.05, * P ≤ 0.05, ** P ≤ 0.01, *** P ≤ 0.001, **** P ≤ 0.0001.( C ) The genotype allele combination intersect matrix.Each row is a different allele and each column is a different genotype allele combination.

Figure 5 .
Figure5.Genomic inference versus AIRR-seq genotype inference.A heatmap of a comparison between the ASC-based genotype method and genomic validation.Each row is an individual, and each column is a different allele.The alleles shown in the heatmap are those for which at least one of the methods inferred a genotype.Dark green and dark purple r epr esent alleles that are only present in the ASC-based method or genomic validation, respecti v ely.Light gr een r epr esents alleles that ar e pr esent in both methods, and white r epr esents alleles that ar e not seen in either.Light purple r epr esents alleles of genes that were resolved with manual inspection.

PAGE 11 OF 14 NucleicFigure 6 .
Figure 6.The effects of an incomplete germline r efer ence set.( A ) The heatmap shows the clusters based on the full germline used in Figure2, and the re-clustering after the reduced germline, which includes alleles that entered the genotype using the ASC-based method on the P1 and P11 cohorts.The bottom ro w sho ws the clusters for the full germline r efer ence set, and the top ro w sho ws the clusters for the new germline.The colors r epr esent the differ ent clusters.White r epr esents alleles that did not enter the genotype.( B ) Summation of the number of families , clusters , and alleles in the S1 germline and the r educed r efer ence.The x-axis is the differ ent r efer ence sets and the y-axis is the count of the e v ents .( C ) The frequency of multiple cluster / gene assignments .The x-axis is the differ ent r efer ence sets and the y-axis is the absolute frequency of multiple assignments.Each dot is an individual's multiple assignment frequency from the non-nai v e P4 cohort.